Network Fault Tolerance in LA-MPI

نویسندگان

  • Rob T. Aulwes
  • David J. Daniel
  • Nehal N. Desai
  • Richard L. Graham
  • L. Dean Risinger
  • Mitchel W. Sukalski
  • Mark A. Taylor
چکیده

LA-MPI is a high-performance, network-fault-tolerant implementation of MPI designed for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and performance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction that makes LA-MPI highly portable, gives high-performance through message striping, and most importantly provides the basis for network fault tolerance. Finally we include some performance numbers for Quadrics Elan, Myrinet GM and UDP network data paths.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High Performance Broadcast Support in La-Mpi Over Quadrics

LA-MPI is a unique MPI implementation that provides network-level fault-tolerant message passing. This paper describes the efficient implementation of a scalable MPI broadcast algorithm. LA-MPI implements a generic version of the broadcast algorithm using a spanning tree method built on top of point-to-point messaging. However, the Quadrics network, with it’s hardware broadcast support, provide...

متن کامل

Implementing a Hardware-Based Barrier in Open MPI

Open MPI is a recent open source development project which combines features of different MPI implementations. These features include fault tolerance, multi network support, grid support and a component architecture which ensures extensibility. The TUC Hardware Barrier is a special purpose low latency barrier network based on commodity hardware. We show that the Open MPI collective framework ca...

متن کامل

Fault Tolerance in MPI Programs

This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, ...

متن کامل

Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications

Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to t...

متن کامل

Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters

We propose lightweight middleware solutions that facilitate and simplify the execution of failure-resilient MPI programs across multidomain clusters. The system described in this paper leverages H2O, a distributed metacomputing framework, to route MPI message passing across heterogeneous aggregates located in different administrative or network domains. MPI programs instantiate a specially writ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003